Exercise 11

Author

Dipayan Akhuli (dipayan-akhuli)

Published

December 1, 2023

Set seed

set.seed(42)

Load packages

library(CATALYST)
library(diffcyt)
library(ExperimentHub)
library(readxl)
library(ggplot2)
library(reshape2)

Question 1: Loading data

eh <- ExperimentHub()
(bodenmiller <- query(eh, "Bodenmiller"))
ExperimentHub with 21 records
# snapshotDate(): 2023-04-24
# $dataprovider: NA, Bodenmiller Group
# $species: Homo sapiens
# $rdataclass: flowSet, SummarizedExperiment, data.frame
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["EH2254"]]' 

           title                                           
  EH2254 | Bodenmiller_BCR_XL_SE                           
  EH2255 | Bodenmiller_BCR_XL_flowSet                      
  EH3047 | Weber_BCR_XL_sim_main_SE                        
  EH3048 | Weber_BCR_XL_sim_main_flowSet                   
  EH3049 | Weber_BCR_XL_sim_null_rep1_SE                   
  ...      ...                                             
  EH3061 | Weber_BCR_XL_sim_less_distinct_less_50pc_SE     
  EH3062 | Weber_BCR_XL_sim_less_distinct_less_50pc_flowSet
  EH3063 | Weber_BCR_XL_sim_less_distinct_less_75pc_SE     
  EH3064 | Weber_BCR_XL_sim_less_distinct_less_75pc_flowSet
  EH3564 | Damond Diabetes Spatial Dataset                 

Bodenmiller_BCR_XL_flowSet is associated with query ID EH2255.

(data_fs <- eh[["EH2255"]])
A flowSet with 16 experiments.

column names(39): Time Cell_length ... sample_id population_id
colnames(data_fs)
 [1] "Time"            "Cell_length"     "CD3(110:114)Dd"  "CD45(In115)Dd"  
 [5] "BC1(La139)Dd"    "BC2(Pr141)Dd"    "pNFkB(Nd142)Dd"  "pp38(Nd144)Dd"  
 [9] "CD4(Nd145)Dd"    "BC3(Nd146)Dd"    "CD20(Sm147)Dd"   "CD33(Nd148)Dd"  
[13] "pStat5(Nd150)Dd" "CD123(Eu151)Dd"  "pAkt(Sm152)Dd"   "pStat1(Eu153)Dd"
[17] "pSHP2(Sm154)Dd"  "pZap70(Gd156)Dd" "pStat3(Gd158)Dd" "BC4(Tb159)Dd"   
[21] "CD14(Gd160)Dd"   "pSlp76(Dy164)Dd" "BC5(Ho165)Dd"    "pBtk(Er166)Dd"  
[25] "pPlcg2(Er167)Dd" "pErk(Er168)Dd"   "BC6(Tm169)Dd"    "pLat(Er170)Dd"  
[29] "IgM(Yb171)Dd"    "pS6(Yb172)Dd"    "HLA-DR(Yb174)Dd" "BC7(Lu175)Dd"   
[33] "CD7(Yb176)Dd"    "DNA-1(Ir191)Dd"  "DNA-2(Ir193)Dd"  "group_id"       
[37] "patient_id"      "sample_id"       "population_id"  

Out of the 39 columns, we have a total of 33 recorded markers (with suffix Dd), out of which 9 are neither type nor state markers (as seen later); these are not considered any further.

The flowSet contains 16 experiments, i.e., there are a total of 16 samples.

num_cells <- rep(0, 16)
names(num_cells) <- rownames(phenoData(data_fs))
for (i in 1:16) {
  num_cells[i] <- dim(exprs(data_fs[[i]]))[1]
}
num_cells
   PBMC8_30min_patient1_BCR-XL.fcs PBMC8_30min_patient1_Reference.fcs 
                              2838                               2739 
   PBMC8_30min_patient2_BCR-XL.fcs PBMC8_30min_patient2_Reference.fcs 
                             16675                              16725 
   PBMC8_30min_patient3_BCR-XL.fcs PBMC8_30min_patient3_Reference.fcs 
                             12252                               9434 
   PBMC8_30min_patient4_BCR-XL.fcs PBMC8_30min_patient4_Reference.fcs 
                              8990                               6906 
   PBMC8_30min_patient5_BCR-XL.fcs PBMC8_30min_patient5_Reference.fcs 
                              8543                              11962 
   PBMC8_30min_patient6_BCR-XL.fcs PBMC8_30min_patient6_Reference.fcs 
                              8622                              11038 
   PBMC8_30min_patient7_BCR-XL.fcs PBMC8_30min_patient7_Reference.fcs 
                             14770                              15974 
   PBMC8_30min_patient8_BCR-XL.fcs PBMC8_30min_patient8_Reference.fcs 
                             11653                              13670 

The first pair of samples (from patient 1) have only around 2800 cells each. Except this pair, all other samples have around 7000 to 17000 cells.

Question 2: Constructing a SingleCellExperiment

url <- "https://zenodo.org/records/10039274/files"
fns <- list(panel="PBMC8_panel_v3.xlsx", md="PBMC8_metadata.xlsx")
for (fn in fns) {
  download.file(file.path(url, fn), destfile=fn, mode="wb")
}
data_panel <- read_excel(fns$panel)
data_md <- read_excel(fns$md)
(sce <- prepData(data_fs, data_panel, data_md))
class: SingleCellExperiment 
dim: 24 172791 
metadata(2): experiment_info chs_by_fcs
assays(2): counts exprs
rownames(24): CD3 CD45 ... HLA-DR CD7
rowData names(3): channel_name marker_name marker_class
colnames: NULL
colData names(3): sample_id condition patient_id
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
dim(counts(sce))
[1]     24 172791

There are 24 markers recorded. The total number of cells over all samples is 172791.

Question 3: Type and state markers

type_markers(sce)
 [1] "CD3"    "CD45"   "CD4"    "CD20"   "CD33"   "CD123"  "CD14"   "IgM"   
 [9] "HLA-DR" "CD7"   
state_markers(sce)
 [1] "pNFkB"  "pp38"   "pStat5" "pAkt"   "pStat1" "pSHP2"  "pZap70" "pStat3"
 [9] "pSlp76" "pBtk"   "pPlcg2" "pErk"   "pLat"   "pS6"   
plotExprHeatmap(sce, features="type", row_clust=FALSE)

Figure 1: Heatmap of median type marker expression across samples

Some markers that seem to violate the assumption that expressions are fairly consistent across conditions are CD7 (higher in stimulated), CD4 (higher in reference), HLA-DR (higher in stimulated), and CD20 (higher in reference). Additionally, CD3 and CD14 also show slightly variable expression across samples but not systematically between conditions. The only markers which are consistently expressed similarly across samples and conditions are CD45, CD123, CD33, and IgM.

Question 4: Clustering

set.seed(42)
sce <- cluster(sce, features="type", xdim=8, ydim=8, maxK=10)
# cell counts per meta8 cluster per sample
dcast(data.frame(sample_id=sample_ids(sce), cluster_id=cluster_ids(sce, "meta8")), sample_id~cluster_id)
   sample_id    1   2    3    4    5    6    7   8
1     BCRXL1  763  22  122  612  163  965  141  50
2     BCRXL2 5948 473  860 2354 1367 4933  392 348
3     BCRXL3 3921 645 1558 2218 1108 1747  739 316
4     BCRXL4 2811 595 1165 1663  896 1235  434 191
5     BCRXL5 3162 332  978 1233  636 1575  453 174
6     BCRXL6 3807 499 1096  743  586 1226  419 246
7     BCRXL7 3925 438  678 3825 1931 3014  677 282
8     BCRXL8 3640 444  790 1557 1795 2644  478 305
9       Ref1 1215  15  224  422  157  461  175  70
10      Ref2 7864 291 1834 1619 1674 2431  651 361
11      Ref3 3215 268 1547 1406 1066  712 1029 191
12      Ref4 2034 377 1728 1063  605  616  387  96
13      Ref5 4530 340 1641 1313 2055 1079  733 271
14      Ref6 5146 303  887  733 1892  921  787 369
15      Ref7 4310 333  818 3384 3720 2080 1011 318
16      Ref8 4986 349 1118 1485 2490 2185  720 337
# cell percentages per 8-metacluster per condition
cells_per_cluster <- sweep(dcast(data.frame(condition=colData(sce)$condition, cluster_id=cluster_ids(sce, "meta8")), condition~cluster_id)[,2:9], 1, 0.01*c(sum(n_cells(sce)[1:8]), sum(n_cells(sce)[9:16])), "/")
rownames(cells_per_cluster) <- unique(colData(sce)$condition)
colnames(cells_per_cluster) <- paste0("cluster", 1:8)
cells_per_cluster
      cluster1 cluster2  cluster3 cluster4 cluster5 cluster6 cluster7 cluster8
BCRXL 33.17051 4.088069  8.592296 16.84194 10.05655 20.55772 4.425975 2.266934
Ref   37.64924 2.573263 11.076565 12.91719 15.44297 11.85442 6.210429 2.275914

Clusters 2, 4, and 6 are more frequent in the stimulated samples as compared to the reference samples, while clusters 1, 3, 5 and 7 are less frequent. Cluster 8 is almost equally frequent in both conditions.

Question 5: Dimensionality reduction

set.seed(42)
sce <- runDR(sce, dr="UMAP", cells=500, features="type")
plotDR(sce, color_by="patient_id")

Figure 2: UMAP representation of cells colored by patient ID

We see that cells are more or less spread homogeneously across all patients, so batch effects across patients can be neglected.

plotDR(sce, color_by="condition")

Figure 3: UMAP representation of cells colored by condition

We see a clear rightward shift of the stimulated samples as compared to the reference samples in the right half of the UMAP plot, with a slight upward shift also seen in the left half of the plot.

Question 6: Exploratory data analysis

pbMDS(sce, color_by="patient_id")

Figure 4: Pseudo-bulk MDS plot of samples colored by patient ID

All 8 pairs of samples (each pair collected from the same patient) are well-separated in the first dimension. Additionally, the pair of samples from patient 6 is also well separated in the second dimension.

pbMDS(sce, color_by="condition")

Figure 5: Pseudo-bulk MDS plot of samples colored by condition

All BCRXL-stimulated samples are on the left half of the MDS plot, well separated in the first dimension from the reference samples which are all on the right half of the plot. However, the BCRXL6 sample is somewhat close to the reference samples in the first dimension. Additionally, the reference samples are more spread out within themselves as compared to the BCRXL-stimulated samples.

Question 7: Differential state (DS) analysis

ei <- ei(sce)

# linear mixed model with a random intercept for each patient
ds_formula1 <- createFormula(ei, cols_fixed="condition", cols_random="patient_id")

# regular linear model
ds_formula2 <- createFormula(ei, cols_fixed="condition") 

# create contrasts between conditions
contrast <- createContrast(c(0,1))

# run diffcyt for the two models at the 8-metacluster level
res_ds1 <- diffcyt(sce, ei, formula=ds_formula1, contrast=contrast, analysis_type="DS", method_DS="diffcyt-DS-LMM", clustering_to_use="meta8")

res_ds2 <- diffcyt(sce, ei, formula=ds_formula2, contrast=contrast, analysis_type="DS", method_DS="diffcyt-DS-LMM", clustering_to_use="meta8")

# test results for the two models
(tbl_ds1 <- rowData(res_ds1$res))
DataFrame with 112 rows and 4 columns
    cluster_id marker_id       p_val       p_adj
      <factor>  <factor>   <numeric>   <numeric>
1            1     pNFkB 2.44332e-07 8.55162e-07
2            2     pNFkB 3.10957e-08 1.28990e-07
3            3     pNFkB 2.68674e-14 2.31473e-13
4            4     pNFkB 4.77849e-11 2.81679e-10
5            5     pNFkB 3.28304e-09 1.53209e-08
...        ...       ...         ...         ...
4            4       pS6 3.16311e-05 9.08379e-05
5            5       pS6 1.58638e-03 3.52018e-03
6            6       pS6 2.87609e-07 9.76128e-07
7            7       pS6 0.00000e+00 0.00000e+00
8            8       pS6 1.14738e-03 2.79361e-03
(tbl_ds2 <- rowData(res_ds2$res))
DataFrame with 112 rows and 4 columns
    cluster_id marker_id       p_val       p_adj
      <factor>  <factor>   <numeric>   <numeric>
1            1     pNFkB 1.44285e-04 0.000850520
2            2     pNFkB 9.50763e-04 0.003600878
3            3     pNFkB 2.05407e-04 0.001023640
4            4     pNFkB 1.23368e-05 0.000151000
5            5     pNFkB 3.75724e-05 0.000350676
...        ...       ...         ...         ...
4            4       pS6 1.20918e-03 4.23214e-03
5            5       pS6 1.07569e-02 3.34660e-02
6            6       pS6 5.75568e-04 2.47937e-03
7            7       pS6 3.68328e-12 4.12527e-10
8            8       pS6 1.00081e-02 3.20259e-02

There are a total of 8*14 = 112 tests, since there are 8 metaclusters and 14 state markers. Each test is to identify whether the given state marker is expressed differentially across conditions in the given metacluster of cells.

threshold <- 0.05

# number of DS findings for the two models at FDR <= 0.05
table(tbl_ds1$p_adj<=threshold)

FALSE  TRUE 
   28    84 
table(tbl_ds2$p_adj<=threshold)

FALSE  TRUE 
   70    42 

When we account for patient variability, we identify many more DS markers (84 as compared to 42 without), thus, the sensitivity is doubled here.

Question 8: Visualizing DS results

plotDiffHeatmap(sce, tbl_ds1, k="meta8", top_n=50, fdr=0.05)

Figure 6: Heatmap of z-normalized expression of the top 50 DS markers across samples

We observe the expression of the top 50 DS markers (ordered by FDR) across samples. We indeed see that the markers are expressed more in a certain cluster for the stimulated samples as compared to the reference samples (like pPlcg2 in cluster 7) and vice versa (like pBtk in cluster 1).

We now visualize a UMAP representation of the clusters at the 8-metacluster level to identify where each cluster lies, and then plot UMAPs of the specific clusters where a given marker is differentially expressed (colored by scaled expression and split by condition) for the top 10 hits.

plotDR(sce, color_by="meta8")

Figure 7: UMAP representation of clusters at the 8-metacluster level
top_hits <- topTable(res_ds1, top_n=50)

plots <- vector("list", 10)
for (i in 1:10) {
  print(plotDR(sce[,cluster_ids(sce, "meta8") == top_hits$cluster_id[i]], color_by=as.character(top_hits$marker_id[i]), facet_by="condition") +
  labs(title=paste0(top_hits$marker_id[i], "(", top_hits$cluster_id[i], ")"), x="UMAP1", y="UMAP2") +
  theme(axis.text =element_blank()))
}

We see that the differential expression of the respective DS markers is quite clear in the displayed clusters. For example, the pBtk marker is consistently expressed lower in the stimulated samples as compared to the reference samples in clusters 1, 3, 4, 6, 7, and 8, while the pPlcg2 and pS6 markers are expressed higher in the stimulated samples as compared to the reference samples in clusters 7 and 3, respectively.


Runtime in seconds:

paste(signif((proc.time() - startTime)["elapsed"], digits=4), "s")
[1] "34.04 s"